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Abstract 

We consider the problem of clustering functional data while jointly selecting the most relevant 
features for classification. This problem has never been tackled before in the functional data 
context, and it requires a proper definition of the concept of sparsity for functional data. 
Functional sparse clustering is here analytically defined as a variational problem with a hard 
thresholding constraint ensuring the sparsity of the solution. First, a unique solution to 
sparse clustering with hard thresholding in finite dimensions is proved to exist. Then, the 
infinite dimensional generalization is given and proved to have a unique solution. Both the 
multivariate and the functional version of sparse clustering with hard thresholding exhibit 
improvements on other standard and sparse clustering strategies on simulated data. A real 
functional data application is also shown. Key words — Sparse Clustering, Functional 
Data, Weighted Distance, Variational Problem 



1 Introduction 


In a clustering problem, it is unlikely that the real underlying groups differ in all the features 
considered. In most situations, only a limited number of features is relevant to detect the 
clusters. This is a known fact, part of the folk knowledge on cluster analysis dealing with 
vector data in M p , and it further complicates the analysis as much as the number of features p 
is larger than the sample size N. We want to consider, here, methods that are able to cluster 
data while also selecting their most relevant features. Such clustering methods are called 
sparse. Sparse clustering has, at least, three advantages: firstly, if only a small number of 
features separates the clusters, it might result in a more accurate identification of the groups 
when compared with standard clustering. Moreover, it helps the interpretation of the final 
grouping. Finally, it reduces the dimensionality of the problem. 

Sparse clustering methods for multivariate data have already been proposed. When di¬ 
mensional reduction is the major focus, a common approach to non-parametric classification 


is based on Principal Component Analysis (Gosh and Chinnaiyan, 2002; Liu et al. 2003). 


However, the use of PCA does not necessary lead to a sparse solution, often not even efficient, 
since principal components are usually linear combinations of all the features considered, i.e. 
very few loadings are zero. Moreover, there is no guarantee that the reduced space identi¬ 
fied via PCA contains the signal that one is interested in detecting via clustering (see the 


study perfomed in Chang (1983)). Indeed sparse PCA clustering methods for vector data 


have been recently proposed, see for instance Luss and d’Aspremont (2010) and references 
therein. 

Model-based approaches to clustering have also been extensively studied in recent years, 
and they generally assume data to be generated from a mixture of Gaussian distributions 


with K components (Fraley and Raftery, 2002 McLachlan and Peel, 2000). In the past 


decade it has been realized that the performance of model-based clustering procedures can 
be degraded if irrelevant or noisy variables are present. Hence, many efforts have been 


made to develop variable selection for model-based clustering: the proposal by Raftery and 
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Dean (2006), and its improvements proposed by Maugis et al. (2009afb), can be considered 


as the current state of the art on variable selection methods in model-based clustering. 


However, even if these approaches seem promising (Celeux et al. 2013), none of them has 
been generalized to the functional data framework. 

A different approach, called regularization , consists in enforcing the solution of the clus¬ 
tering problem to be sparse by a suitable penalization. For instance, the penalty can be 


added to the maximization of the log-likelihood (Pan and Shen 2007; Wang and Zhu 2008 


Xie et al. 2008). Friedman and Meulman (2004) proposed a completely different strategy, 


named Clustering Objects on Subsets of Attributes (COSA), which can be seen as a weighted 
version of K-means clustering where different feature weights within each cluster are allowed. 
However, this proposal does not correspond to a truly sparse method, since all variables have 


non-zero weights for positive values of the regularization parameter A. Witten and Tibshi- 


rani 


(2010) introduced a framework for regularization-based feature selection in clustering, 


which makes use of a Lasso type penalty. This corresponds to a soft thresholding operator, 
since it results in sparsity just for small values of the tuning parameter s defining the L 1 
Witten and Tibshirani (2010) decline their proposal both in the K-means case, 


constraint. 


and for hierarchical clustering procedures, also proposing a nice strategy to tune the sparsity 
parameter s on the basis of a GAP statistics. 

We aim here at developing a method for sparse clustering of functional data. Except for 
proposals for clustering sparsely observed curves (James and Sugar, 2003), the only contri¬ 


bution in this direction is Chen et al. (2014), where a weighted L 2 distance among functional 
data is proposed to improve the performance of various statistical methods for functional 
data, including clustering. Even though the optimality criterion is quite different, we follow 
the same direction, and we propose a generalization of sparse clustering to functional data 
which is based on the estimation of a suitable weighting function w(-), capable of detecting 
the portions of the curves domain which are the most relevant to the classification pur¬ 
poses. The generalization is then natural if we define clustering as a variational problem, 
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having w(-) and the data grouping as solutions, and enforce sparsity via a proper penalty 
or constraint on w(-). Note that variable selection in model-based clustering requires strong 
modelling assumptions, which are often too restrictive in the functional context; this is the 
main reason why a regularization approach is preferred in this context. However, differently 
from Witten and Tibshirani (2010), the constraint ensuring the sparsity of the solution is 
a hard thersholding operator, meaning that the solution w(-) has to be non-zero only on a 
portion of the domain. As a by product of this formulation, we also derive a new formulation 
for the sparse clustering problem in finite dimensions, with the hard thersholding operator. 
In this sense, our proposal in finite-dimensions is a limiting case of the lasso-based proposal 


by Witten and Tibshirani (2010). 

The rest of the paper is organized as follows. In the next section, we briefly review the 
sparse clustering framework proposed in Witten and Tibshirani (2010) for finite dimensional 
data, and we recall their main result with the aim of helping the reader in noticing the 
analogies and differences with our approach. We introduce our hard thresholding proposal to 
sparse clustering in the finite dimensional setting, and we state the existence and uniqueness 
of the solution to this problem. In Section [3j after defining the theoretical setting and 
focussing on the meaning of features selection in infinite dimensions, we state the variational 
problem defining sparse clustering for functional data, and prove the existence and uniqueness 
of its solution. In Section [4] we test our method on synthetic data, while also comparing it 
to other multivariate and functional clustering approaches. An application to the analysis 
of the Berkeley Growth Study data, a benchmark in functional data analysis, is described 
in Section [5j We conclude the paper in Section [6] with a discussion, where we also give 
a perspective on future developments. All simulations and analysis of real datasets are 


performed in R (R Development Core Team 2011). 
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2 Multivariate Sparse Clustering 

Let X G M Arxp be the data matrix, and let us indicate with Xj, j — 1,... ,p the j-th column 
of X, including the values of the j-th feature for all data in the sample. Many clustering 
methods can be expressed as a maximization problem of the form 


p 

max ^g j (X j ]C 1 ,...,C K ), (1) 

ci,...,c k —, 

3 =1 


where gj(X.j', Cj,..., Ck) is some function that involves only the j-th feature of the data, 
and Cj,..., Ck is a partition of the N data into K disjoint sets. K-means clustering lies 
in this general framework, with (jj(-) being the between clusters sum of squares for feature 

j- 


Witten and Tibshirani (2010) define sparse clustering as the solution to the following 


maximization problem 


™ ax r . . r b( x b C, • • •, C K ), 


weRP;Ci,...,C K 

subject to 11 w11£ 2 < 1, ||wll^ < s, Wj > 0 for j = 1,... ,p. 


j = 1 
12 


( 2 ) 


The weighting vector w G M p has the role of feature selector, achieved by the lasso-type 
constraint on the tj norm. The other constraints are obvious: w must have at most unitary 
Euclidean norm, and it must be non-negative. The tuning parameter s is specified by outer 
information. 

The proposal in Witten and Tibshirani (2010) is to tackle the optimization problem (|2]) 
via an iterative algorithm, that alternatively seeks the maximum in w by holding Cj,..., Ck 
fixed, or finds the best partition Cj,..., Ck by holding the weighting vector w fixed, until 
no significant variation in the weight vector w is observed. Convergence to a global optimum 
is not achieved in general, but we are guaranteed to increase the objective function at each 
iteration. When w is fixed, the best partition is found by using a standard clustering 
procedure to a weighted version of the data. When the aim is optimizing (J2]) with respect 
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to w by holding C t ,..., Ck fixed, the problem can be written as 


max w'a, (3) 

weR p 

subject to 11w11f < 1, 11w11< s, Wj > 0 for j — 1,... ,p, 


where a 6 M p is such that its j-tli component is a 3 = gj(X.j ; Ci ,..., Ck)- For x G M and 
c > 0, define S(x,c ) = sign(x)(|a:| — c)+ to be the soft thresholding operator. For ease 
of notation we allow for the operator S to be applied component-wise to vectors of M p . 


Then, the following result holds, directly from Karush-Kuhn-Tucker conditions (Boyd and 


Vandenberghe (2004)) 


Proposition 1 (Witten and Tibshirani (2010)). 

For 1 < s < y/p, a solution to problem |5j) is 


S(a+, A) 

||5'(a+, A )||^2 ’ 


where x + denotes the positive part of x G M and A = 0 if that results in ||ro ||4 < s; otherwise, 
A > 0 is chosen to yield = s. 

Instead of enforcing sparsity with a lasso-type constraint on the norm of w, our pro¬ 
posal consists in forcing a certain number of components of w to be exactly zero, i.e. re¬ 
formulate the same problem with respect to a counting measure. This would mean defining 
multivariate sparse clustering as the solution to the following maximization problem 

p 

wCl. ■ ■ ■ ■ Ck), (4) 

3 = 1 

subject to 11w11f <1, Wj > 0 for j — 1,... ,p and p\{j : Wj = 0}) = m, 


where /r indicates the counting measure on = {l,...,p}. The weighting vector w G R p 
(feature selector) is now forced to have exactly m components equal to zero, where m is 


5 










an integer such that 0 < m < p. There is a relation between the parameter s in ([Tj) and 
the parameter m we are introducing here, since for small values of s some components will 
indeed be null. However, the solution of problem (|4]) is a hard-thresholding solution, which 
the soft-thresholding solution found by optimizing ([2]) can recover only for some limiting 
values of s. It will be shown via simulation studies that this hard thresholding strategy 
reduces the misclassification error in a high-dimensional (non necessarily functional) setting. 

The optimization problem Q can be tackled by an iterative algorithm. Hence, we only 
need to consider the following constrained optimization problem 


max w'b (5) 

weRp 

subject to 11w11 ^2 <1, wj > 0 for j — 1,... ,p and pfi{{j : Wj = 0}) = m , 


where b = (bi,...,b p )' is a vector of M p , and we assume that the components of b are all 
greater than zero and with no ties, i.e. bi > 0 and bi ^ bj, for i , j = 1 i ^ j. The 

vector b has the role of describing the goodness of the partition with respect to each of 
the features. In order to represent the solution of problem (jsj) , let fcp) be the i-th ordered 
component of the vector b and set B = {i E Q : b t > 6( m )}, letting B — if m — 0, and 
B = 0 if m = p. 

Theorem 1. 

Problem 0 has a unique solution w* e M p whose components are: 


(£* 6 b &?) 3 


r, ifieB, 


wj = 


0, 


otherwise. 


Proof. Consider the function /( w) = Yl P j=i w jbj defined on the domain D = {w G M p : 
11w11 £2 <1, Wj > 0 for j — 1,... ,p and fJ({j : Wj = 0}) = m}. The function / is continuous 
on D and D is a compact subset of R p , being the finite union of closed and bounded sets. 
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Therefore problem (J5]) has a solution. Let w be a solution of problem (J5]) and set A = {j : 
Wj > 0}. The set A and the set B coincide. Indeed the two sets have the same number of 
elements. If they were not the same set, there would exist an % 6 A \ B and a j G B \ A such 
that, by switching the value u>; with the value Wj , one could construct a w e M. p (satisfying 
the constraints of problem ([5]) ) such that /(w) > /(w), against the assumption that w is a 
solution of problem (|5]). In fact, set: 


( 




Then ||w ||^2 < 1, Wj > 0 for j 
v 

= 

3 = 1 

< 


Wj 

= Wi, 





Wi 

= W 31 




(6) 

Wk 

= w k , 

for k G hi \ {i, j}. 



= 1, 

... ,p and : Wj 

= o}) = 

m, while 



E 

w k b k + Wjbj 

+ wA 




.-.P}\{*>3'} 






E 

w k b k + Wibj 

+ 

$ 

P* 

II 

P 

(7) 

&e{i. 

.-.P>\{*>3'} 



3=1 



Thus A and B coincide. Therefore, if w is a solution of (j5]), 


v 

3=1 


= ^2 W 3 b 3 

jeA 


- 

j&B 

< 


j&B j£B 


j&B 


where the first inequality is Cauchy-Schwarz inequality and the second follows from the fact 
that ||w (|^2 < 1. Take w* e with components 





if i E B, 


0, otherwise. 
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Then w* satisfies the constraints of problem (5) and Yl^=i w j^j = Hence, w* is 

a solution of problem ([5]). 

Let w be another solution of problem (J5|. Indicate with b* the vector of M p with com¬ 
ponents b* = bi if i G B, 0 otherwise, and with 6 the angle between w and b*. Then 
008 ( 6 *) = = || dj| — , where the second equality is true because Y2j=i^jbj — 

£?=i w jbj = (Ejestf) 2 — |b*||^ 2 , being w a solution of (pi). Since ||w ||^2 < 1, and 
cos( 6 *) = 1 / 11w11 £2 , then ||w ||^2 = 1. Hence cos( 6 *) = 1, and 


b* 



proving that the solution of problem ([5]) is unique. □ 

We remark that the uniqueness of the solution given by Theorem [T| depends on the 
hypothesis of having no ties in the vector b, because ties make the sorting of b not unique. 
However, this assumption seems reasonable in practice, being b the vector describing cluster 
separation (e.g., the Between-Clusters Sum of Squares (BOSS) in case of K-means) with 
respect to each feature. 

Having found the unique optimal weighting vector, given by Theorem [TJ we are guaran¬ 
teed that also the solution to the finite dimensional sparse clustering problem Q is unique, 
since the number of possible partitions of the data is finite. However, this theoretical guar¬ 
antee does not solve the problem in practice, since finding the maximum over the set of all 
possible partitions is a combinatorial optimization problem, for which a complete enumera¬ 
tion search over all possible groupings of the data is computationally impractical in appli¬ 
cations. A large number of heuristic search strategies have been proposed in the literature 
for ordinary weighted clustering based on criteria of the form ([I]). Hence, the optimization 
problem ([4]) can be tackled by an iterative algorithm where the clustering step is specified 
via a proper search strategy algorithm. Details on the proposed procedure are given in the 
next subsection. The algorithm performance will be tested in Section [4] on synthetic data. 









2.1 A K-means implementation of Multivariate Sparse Clustering 


We are now going to set np an algorithm which is suitable for finding the solution to problems 
of the form Q, where the functions ^(X ? ; C'i, ..., Ck) correspond to the BOSS with respect 
to the 7 -th feature and in the i 2 distance 


9i(Xj;C I ,...,C K ) = F £E 


N N I< i 

V Xi 'j) ~ 

i= 1 i'=1 k= 1 J ii'eCfe 


1C. 


Xi'j ) , 


( 8 ) 


and where W is the number of elements belonging to the fc-th cluster. Hence, the criterion 
used in our algorithm assigns higher weights to features that can cause a higher increase in 
the BOSS, i.e. to features with respect to which clusters can be distinguished the most. 

It is well-known from the literature on cluster analysis that the iwneans algorithm 


(Pollard, 1981 Hartigan, 1975, 1978) is an optimal search strategy with respect to criterion 


(JdJ), when (J8| is chosen. Hence, we implement the following Multivariate Sparse K-means 
Clustering Algorithm : 


1. initialize C \,..., Ck by running a simple multivariate A'-means; 

2. iterate until convergence: 

(i) holding C'i,..., Ck fixed, use Theorem [I] to obtain the optimal weighting vector 
w*, where the vector b = {pi,... ,h p ) is such that bj = QjiKp, Ci, ..., Ck) in (|8]) 
for each j = 1,... ,p; 

(ii) holding w = w* fixed, find the optimal partition C'i,..., C K by optimizing 
criterion Q with BOSS, i.e., run a A'-means on the weighted data obtained 
by considering the N x N dissimilarity matrix whose {i,i') element is diy = 

Y7j=i Wj{xij - Xi'j) 2 ] 

3. stop when no more changes in the partition C'i,..., Ck are observed at two subsequent 
iterations. 
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The optimal weighting vector is given by the w* obtained at the last (i) step, while the final 
clusters are given by C'i,..., Ck in the last (ii) step. 

A quite relevant issue remains open at this point: how to select the number m of features 
that we expect to have associated weight Wj null, for a given application? In simulation 
studies and real applications, m can be tuned by using a permutation approach based on 
the computation of a GAP statistics, that has been proved to be quite successful in Witten 


and Tibshirani (2010) for tuning parameter s, and in Tibshirani et al. (2001) for choosing 


the number of clusters K. 


3 Functional Sparse Clustering 


We now focus on the same problem when data lie in an infinite dimensional space. Let 
(Z?,W, /i) be a measure space, with g the Lebesgue measure, and D C M. compact. Let 
/i,..., /y : D —* M be a functional data set, with /* e L 2 (D ) for all i = 1,..., N, which is 
a reasonable assumption in most applications. We further assume that a proper functional 
form has been reconstructed for each datum in a preprocessing step of the analysis, thanks 


to some smoothing technique (see Ramsay and Silverman (2005) and references therein). 


We finally assume that, if a phase variability is present, data have already been aligned (see 
Section [6] for a discussion on this issue). 

Likewise the finite dimensional case, many clustering methods for functional data can be 
written as a variational problem of the form 


max / g(f 1 (x),...,f N (x)]C 1 ,...,C K )dx, (9) 

K J£) 

where for ease of notation we write dx instead of dg(x). The function g(-) depends on the 
chosen clustering strategy; for instance, the standard K-means algorithm for functional data 
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looks for a partition C ±,..., Ck of the data labels that maximizes the BOSS 


g(fi(x), • • •, /ffWi Ci,, C*) = 


1 

2N 


N 

£ 

M=1 




- / t ^)) 2 - X 


/i=l 


1 

I2C4 


XI (M X ) ~ fj( X )) 2 ■ 


( 10 ) 


We first need to give a meaning to the concept of feature selection for functional data, getting 
inspired by the finite-dimensional case. When data are elements of R p , feature selection can 
be carried out by an optimal weighting vector w e R p . The weighting vector w can be 
analytically computed according to Proposition [T| in the context of soft thresholding, and 
to Theorem [T| in the context of hard thresholding. The latter approach, which we suggest, 
consists in finding the subset S of the set of p features that discriminate the most between 
clusters. This idea can be adapted to functional data, provided we change the counting 
measure on the finite dimensional set of features with a proper measure fi on the continuum 
features index set D. In the functional setting the role of the feature selector can be thus 
played by a weighting function w : D —>• M. We define functional sparse clustering as 
the solution to a variational constrained optimization problem, where the constraints on w 
enforce the sparsity of the optimal solution 


max 

w£L 2 (D)-,C 1 ,...C k 

subject to: 


w(x)g(f 1 (x ),..., f N (x)] Ci,..., C K )clx, 


( 11 ) 


Id 

w(x) | \l 2 (d) < 1, w(x) > 0 p — a.e. and /i({x G D : w(x) = 0}) > m. 


The weighting function vj is non-negative almost everywhere on D, and it belongs to the 
closed unitary ball of L 2 (D), so that the optimization problem is well-posed. The sparsity 
of the method is ensured by the fact that w must be zero on a set of measure m. As for 
the weighting vector in finite dimensions, the role of w is to select the relevant features for 
clustering: if a Borel set B C D is not relevant for cluster identification, we should expect 
w(x ) = 0 for x E B, while if curves belonging to different clusters differ greatly in B , then 
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w should be strictly positive on that subset, with each value w(x) reflecting the importance 
of x G D for partitioning the data. 

As for the finite dimensional case, we first focus on the problem of optimizing ( JTT| with 
respect to w by holding C'i ,..., C K fixed. Problem ( [II] ) can be tackled via an iterative 
algorithm, that alternatively computes the optimal w by holding Ci,..., Ck fixed, or finds 
the best partition Cj,..., C K by applying a suitable functional clustering technique, where 
the distance among functions is weighted according to the optimal w. We will give details on 


this clustering procedure in Section 3.1 We now focus on the following variational problem 


max / w(x)b(x)dx , 

w&L 2 (D) J D 

subject to: \\w(x)\\l^(d) < 1, w(x) > 0 p — a.e. and /i({x G D : w(x) = 0}) > 


( 12 ) 


m. 


where 0 < m < n(D) < oo, and b(x) is continuous and non-negative a.e. Since D is compact, 
b G L°°(D). The function b{x) depends on the data and on the partition C'i,... Ck , and 
it defines how well the partition discriminates among clusters. Our main result states the 


existence and uniqueness of the solution to problem (12). 


Theorem 2. 


There exists a solution to problem (12) given by w(x) = where Ib(%) is the 


indicator function of the set B — {x G D : b{x) > k}, for a suitable k > 0. Moreover the 
solution is [pi]-a.e. unique if the function <f>(t) := p({x G D : b{x) < i}) is continuous. 


Proof. Let {b n (x )} n£ n be a monotone increasing sequence of simple functions b n (x) = 
Y^i=i c ?Ib"(x) : such that b n (x) converges to b{x) in L°°(D). The collection of sets {B 7 f}'f =l 
forms a partition of D, i.e. Bf fl Bf = 0 V i ^ j and (J” =i Bf - = D. Fix k G M + such 
that /u({x G D : b(x) > k}) = n(D) — m. This is always possible thanks to the continuity 
of 4>(t). Then, the set Aj. — {x G D : b(x) < k} is such that pi{Ajf) = m. Thanks to the 
continuity of b(x) and the finiteness of D , A *. is also a compact set. Let be the sub- 
collection of indices of the sets {Bf[f =1 such that the intersection with Ak is non empty 
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H% := {h G {1,..., n} : A k fl ^ 0}, and let the number of elements in HJ! be m k . Then, 
define B:= U heH^B^. We have: 

1. b(x) > k,Vx G D \ B%; 

2. A k C B k , and thus n{B k ) > n(A k ) = m; 

3. c\ < k,Wh G H£. 

Consider w, b” G M n , and the following finite-dimensional optimization problem: 

max (w, b n ) (13) 


s. t. ||w ||^2 <1, Wi > 0 Vi = 1,..., n and fi\{i : Wi = 0}) > m k , 

where b n = (c™,..., c”)'. According to Theorem [TJ the solution is given by the vector 
w* ,n = (wl’ n , ..., ru*’" - )', where: 


-a-T if i e X”, 

(X ie i ? (cD 2 )2 

k 


*,n 

w/ = < 


(14) 


0 otherwise, 


with X^ = {1,... ,n} \ H^. It is always possible to select the collection { B ™}" =1 such that 


there are no ties in b n , and thus the solution in (14) is unique for every n. Now, consider 


the sequence of functions {w n (a;)} rigN defined as w n (x) = Y^i=\ w i’ n ^Rl l ( x )- We claim that 
w n {x) —» w{x) uniformly, where w(x) = — Ib(%) and B = {x E D : b(x) > k} . 

Indeed, because of the uniform convergence of b n (x) to b(x) and the boundedness of D, we 
have that 


c r / sr( x ) ^ b(x)i B (x) 


(15) 


iexr 
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and \\b n (x)\\ LP -> \\b(x)\\ LP , Vp > 1. Then 


\w n (x) -w(x)\\ L co < ii 


(Eiex«( c D 2 ) 5 




\\K x )\\l 2 (b) 


Ibv- (x) 


E 


f \\K x )\\l 2 (b) 


lB n {x) ~ 


b(x) 


\\K x )\\l 2 (b) 


Ib{x) 


L°° • 


(16) 


As n —>■ oo, the second term in 
have: 


(16) vanishes because of (15), whereas for the first term we 


E 


(E 




I)2 ) 


-Ibv-{x) 


-E 


\\K X )\\l 2 {B)' 


(%) ||l°° < 


max5(a;) 

x&D 


|| 6 (x)|| L 2 -|| 6 n (a ;)|| L 2 ^ ma x xeD b(x) 

||6 fl (x)|| La ||6(x)|| La 1 - ||6 1 (x)||| 2 


|||^)IU 2 -||W|| L 2|, 


which tends to 0, when n —>■ oo, given that &i(x) 7 ^ 0. If b\{x) = 0, then we take b- 2 (x) in the 
last inequality and the result still holds. It is easy to check that w(x) satisfies the norm and 
the non-negativity constraints. The constraint on the measure of the sparsity set is implied 
by the uniform convergence and the maximality of w(x) follows from the continuity of the 
limit. The uniqueness a.e. of the solution is obvious. □ 


Since the proof makes use of Theorem [l] in the simple function approximation, we do not 
encounter the same issues about the uniqueness of the solution which we were facing in the 
finite dimensional case. Indeed, if the function b(x) is constant on an interval of positive 
measure, it is sufficient to keep the partition {5 "}” =1 fixed V n on the interval where b(x) 
is constant, and let it vary on the rest. We remark that an alternative proof of this result, 
which completely relies on variational theory, is reported in the Appendix. 


3.1 A K-means implementation of Functional Sparse Clustering 

We here propose a possible iterative algorithm implementing the functional sparse clustering 
framework described so far. In particular, we aim at solving the variational problem in (JTT|, 


where g(-) is specified by (10), and thus the solution w : D —> M quantifies the increase in 
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the BOSS that each portion of the domain D could generate. Note that we have to maximize 
over the infinite dimensional set of weighting functions satisfying the constraints, and jointly 
on the (finite, but possibly huge) set of all possible cluster assignments. 

To this aim, we propose the following AT-mean inspired iterative strategy, that we name 
Functional Sparse K-means Clustering Algorithm: 


1 . 

2 . 


initialize Ci ,..., Ck by running functional A'-means (Tarpey and Kinateder 2003); 


iterate until convergence: 


(i) holding C'i ,..., Ck fixed, use Theorem [2] to obtain the optimal weighting function 
w*{x ); 

(ii) holding w = w* fixed, find the optimal partition C'i,..., Ck by optimizing crite¬ 
rion ( fllj ) with BCSS, i.e., run a functional A'-means algorithm according to the 
weighted measure 


d w (fi,fi')= [ w(x)(fi(x) - fi'{x)) 2 dx; (17) 

J D 

3. stop when there are no more changes in the partition. 

The optimal weighting function is given by the w* obtained at the last (i) step, while the final 
clusters are given by C'i,..., Ck in the last (ii) step. As we already noted in the multivariate 
case, this kind of procedure only assures that the value of the objective functional is increased 
at each step, but it does not assure that a global optimum is achieved. We also remark 
that the same permutation-based approach for tuning m on the basis of a GAP statistics, 
proposed for the finite dimensional setting, can be applied here: the domain D is subdivided 
into a possibly large set of sub-domains, and the procedure is then applied to the observed 
functions after they have been randomly permuted within each sub-domain. The number of 
sub-domains depends obviously on the available computing power and on the data dimension. 
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4 Simulation Studies 


The previously described setting for sparse clustering is here tested on synthetic data in 
various situations. 

4.1 A simulation study for multivariate sparse clustering 

Aim of the present simulation study is to compare results of our /b-means implementation 
of multivariate sparse clustering with results given by standard A'-means and by the sparse 


AT-means proposed in Witten and Tibshirani (2010). 


The simulation is run as follows. The number of classes is set to K = 3, both for data 
generation and for clustering, and we generate 20 observations per class, thus leading to 
n — 60 data in each scenario. Only q = 10 features are responsible for differences among 
classes. We simulate three different scenarios, where the data dimension p is respectively 
equal to 50,200,500: this means that in the less extreme situation 20% of the features are 
relevant, while in the more extreme one only 2% of them are relevant. Data are independently 
generated from Gaussian distributions such that, for i = 1,... ,n and j = 1 ,p, Xij ~ 
N(pij, a 2 ), with a = 0.2 and 


Pij 


j/p 


if j > q\ 


j/p+l.ho(l C2 {i)~lcz{i)) if j<q; 


where Ck, k = 1,..., K is the set of data belonging to the k -th cluster and Ia(-) is the 
indicator function of the set A. In order to compare the different partitions we use the 
Classification Error Rate (CER), which equals 0 if the partitions agree perfectly, and 1 in 


case of complete disagreement. Note that CER also equals 1 minus the Rand index (Rand 


1971). 


We aim at checking the adherence of the detected grouping to the real known classes: 


we expect both our sparse A'-means and the proposal in Witten and Tibshirani (2010) to 
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p = 50 p = 100 p = 500 

std 0.0474(0.0641) 0.121(0.0640) 0.213(0.0630) 

WT sparse 0.0277(0.0295) 0.0216(0.0219) 0.0307(0.0281) 

NEW sparse 0.0106(0.0106) 0.0118(0.0117) 0.0225(0.0273) 


Tabic 1: results of the simulation study of Section 4.1 Mean (standard devation) of the 
CER over 20 simulations. Along rows, results obtained with the three methods: standard 
A'-means (std), sparse A'-means of Witten and Tibshirani (2010) (WT sparse) and our 
implementation of sparse A'-means (NEW sparse). 
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Figure 1: results of the simulation study of Section 4.1 Plot of the weighting vector w 


obtained with the sparse A"-means of Witten and Tibshirani (2010) (red), and with our 
implementation of sparse A'-means (black), on two synthetic datasets with p = 50 (left) and 
p = 200 (right). 


be accurate in all situations, and to improve on standard A"-means if the proportion of 
redundant features is large. Results are shown in Table [lj we notice that both sparse 
methods perform well in all situations, while the performance of A'-means becomes worse 
when a large proportion of noisy irrelevant features is present. We also notice a slightly better 


performance of our implementation of sparse A'-means with respect to Witten and Tibshirani 


(2010). The slight superiority of our sparse approach might be explained in the light of a 
more direct impact of the sparsity parameter m on clustering results, with respect to the 
s parameter of Witten and Tibshirani (2010). In Figure [I] we compare the two methods in 
terms of the optimal weighting vector w, for two simulations with p = 50 (left) and p = 200 
(right): in our implementation of sparse A"-means, the optimal sparse parameter estimated 
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Set of synthetic data 



Figure 2: sample of synthetic functional data generated for the simulation study in Section 

via GAP statistics is m = 25 and m = 160 for the two simulations, respectively. This means 
that 50% and more than 75% of the features, respectively in the two cases, are correctly 
recognized as irrelevant and thus discarded from the classifier with a zero associated weight. 


The result obtained with the approach described in Witten and Tibshirani (2010), instead, 


shows that a positive small weight is given to all features, thus potentially including noisy 
confounders in the classifier. 

In conclusion, it is worth remarking that the two procedures are comparably accurate, 
since both correctly assign larger weights to the first 10 features, the ones that are responsible 
for differences in the signal distribution across classes. In our approach, though, the sparse 
parameter has a more straightforward interpretation in terms of the number of irrelevant 
features, which are immediately pointed out by a null weight. 


4.2 A simulation study for functional sparse clustering 

Aim of this second simulation study is to test our innovative A'-means implementation of 
functional sparse clustering. Results will be compared with standard functional A'-means. 

The simulation is run as follows. The number of classes is set to K = 2, both for data 
generation and for clustering, with 100 observations per class, thus leading to n = 200 data 
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Standard functional K-means 


True cluster means 



X 


X 


Figure 3: results of the simulation study of Section 4.2 True cluster mean functions (top, 
left); one of the synthetic datasets coloured according to the clusterization obtained with 
standard functional id-means (top, right) and with sparse functional AT-means (bottom, 
right); optimal weighting function computed by the sparse approach (bottom, left). The 
vertical line in the left plots is drawn for x — 1/2. 


in each scenario. The domain is chosen to be the interval D = [0,1]. The mean function of 
data belonging to the first cluster is fi(x) = (f)sin(&7nr) + a) ■ (a — Ax) + c, x E D, with 
a = 3, b = 2 and c = 0, while the second cluster has mean function 


f2{x) 


(bsm(bnx) + a) ■ (a — Ax) + c x G [0, |]; 

(bsm^bnx) + a) ■ (a — 4(1 — x)) — 2c(x — 1) x £ (|, 1], 


(18) 


with c = 1/2. The two true cluster mean functions are shown in Figure [3] (top-left panel): 
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run 1 2 3 4 5 6 7 8 9 10 

std 0.361 0.387 0.356 0.434 0.422 0.437 0.482 0.382 0.401 0.333 

sparse 0.0345 0.122 0.0567 0.113 0.0490 0.104 0.0502 0.0346 0.104 0.0611 


Table 2: results of the simulation study of Section 4.2 showing the CER values over 10 simu¬ 
lations. Along rows, results obtained with the two competing methods: standard functional 
A'-means (std), and sparse functional A'-means (sparse). 


note that they are nearly equal on the Erst half of the domain, the only difference being a 
vertical shift of 1/2. On the second portion of the domain, instead, they become increasingly 
different for values of the abscissa x closer to 1. Therefore we expect the weighting function 
to give more importance to the second half of the domain, with higher values when closer 
to 1, and to be zero in the Erst half. Data are generated according to f\(x) and f 2 {;x), for 
the first and second cluster respectively, with a rv_/ 1V(3,0.5 2 ) and b ~ A^(2,0.25 2 ). For data 
belonging to the first cluster, the mean function fi(x) is used with c ~ 1V(0,0.5 2 ), while 
for data belonging to the second cluster f 2 (x ) is used with c ~ 1V(0.5, 0.5 2 ). Note that a 
noise term has not been added to the synthetic data: indeed, one of the assumptions of 
our proposal is that the functional form of the data has already been reconstructed (i.e., no 
smoothing is needed). One of the sets of synthetic data analysed in this simulation study 
is shown in Figure [2} data are completely overlapped for x E [0,1/2], but cluster overlap is 
still consistent also for x > 1/2; a sparse clustering approach is thus needed. 

First, the CER index is used to check the adherence to the real known classes of the 
grouping structure detected by the two clustering strategies we are comparing: after 10 
repetitions of the simulation, a standard functional A'-means provides an average CER of 
0.3996, while sparse functional A'-means results in 0.07306, a reduction of more than 80%. 
The CER values for all simulations are shown in Table [2] in all simulated scenarios, the 
sparse approach manages to attain a far better adherence to the real underlying grouping 
structure. 

Secondly, we aim at extensively commenting the results of one simulation, to check the 
misclassification error of the two procedures, the estimated cluster mean functions, and in the 
case of sparse functional A'-means, to inspect the estimated weighting function w. Results 
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std 2-means 


clusters 

sparse 2-means 


clusters 
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2 
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2 

true 

1 

73 

27 

true 

1 

100 

0 


2 

33 

67 


2 

5 

95 


Tabic 3: results of the simulation study of Section 42 showing the confusion matrix of 
cluster assignments vs true labels. Left: results of standard functional 2-means clustering; 
right: results of sparse functional 2-means clustering. 


obtained in the fifth run of the simulation, whose CER value is shown in the fifth column 
of Table [2j are shown in Figure [3j As it can be appreciated from the top-right panel in 
the picture, the standard functional A-means does not provide a meaningful result: nearly 
one third of the curves are misclassihed, and this also reflects on the poor estimation of 
cluster mean functions shown in black in the plot (they can be compared with the true 
cluster means, shown in the top-left panel of the same figure). Results obtained with sparse 
functional K -means, instead, show a pretty nice cluster assignment, coherent with the true 
underlying grouping structure, and a very good estimation of the cluster mean functions. 

The confusion matrices of the two partitions obtained with standard and sparse functional 
A'-means with respect to the true labels are shown in Table [3] with standard A"-means, 30% 
of the data are misclassihed, while only 2.5% with sparse functional A^-means. 

Finally, let us consider the weighting function estimated by sparse functional A"-means, 
and shown in the bottom-left panel of Figure [3] the weighting function is non-zero only on 
the subinterval of the domain [0.521,1], and the optimal sparsity parameter is m = .521. This 
is obviously pointing out the fact that the true cluster mean functions differ mostly in the 
second half of the domain, thus detecting precisely the region of maximal cluster distinction. 
Moreover, w is a strict monotone increasing function on [0.521,1], thus also suggesting that 
data belonging to different clusters distinguish most for values of the abscissa x closer to 1, 
as it is evident from the top-left plot in Figure |3j 
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Misaligned growth curves Misaligned growth velocities Aligned growth velocities 



Figure 4: the Berkeley Growth Study dataset. From left to right: reconstructed growth 
curves, reconstructed growth velocities, and aligned growth velocities. 

5 Case Study: Berkeley Growth Data 

In this section we illustrate the results obtained with sparse functional clustering on the 
growth curves included in the Berkeley Growth Study, a benchmark dataset for functional 
data analysis which is also provided in the f da package (Ramsay and Wickham 2007) in R (|Rj 


Development Core Team, 2011). Aim of this Section is to check whether a sparse approach 


improves the insight on these data with respect to standard techniques, and whether the 
weighting function w helps in the interpretation of the results. Moreover, we aim at deepening 
the discussion on the role of the sparsity parameter, m, in the context of a real application. 


The Berkeley Growth Study (Tuddenham and Snyder, 1954) is one of several long-term 


developmental investigations on children conducted by the California Institute of Child Wel¬ 
fare. It includes the heights (in cm) of 93 children, 54 girls and 39 boys, measured quarterly 
from 1 to 2 years, annually from 2 to 8 years and then biannually from 8 to 18 years. It is 
reasonable to consider these data as discrete observations of a continuous process represent¬ 
ing the height of each single child, i.e. the child growth curve. We reconstruct the growth 


curves by means of monotonic cubic regression splines (Ramsay and Silverman, 2005), im¬ 


plemented using the R function smooth .monotone available in the fda package (Ramsay and 


Wickham 2007). Estimated growth curves and velocities are shown in Figure (left and 
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std 2-means 


clusters 

sparse 2-means 


clusters 



1 2 



1 2 

gender 

M 

37 2 

gender 

M 

37 2 


F 

9 45 


F 

9 45 


Tabic 4: results of the analysis of the Berkeley Growth Study data, showing the confu¬ 
sion matrix of cluster assignments vs gender. Left: results of standard functional 2-means 
clustering; right: results of sparse functional 2-means clustering 


central panels, respectively). 

We aim at comparing the results obtained via sparse functional /T-means with the ones 
obtained with standard functional il-means. As suggested in many papers analysing the 
same dataset, we will focus on growth velocities. From inspection of the central panel in 
Figure [4| we notice that all children show a similar growth pattern, characterized by the 
pubertal spurt, known in the medical literature as a sharp peak of growth velocity between 
10 and 16 years. However, the children follow their own biological clocks, thus resulting 
in a set of velocity curves showing the main growth spurt with different timing/duration. 
Moreover, a second minor feature emerges: some children also have a minor growth velocity 
peak between 2 and 5 years, called mid-spurt, which can be quite consistent but also absent. 
We would also like this minor feature not to be confounded by phase variability. Hence, we 
also analyse an aligned version of growth velocities, obtained by applying 1-mean alignment: 


this technique, first introduced in Sangalli et ah (2010), has been tested on the Berkeley 
Growth Study, thus showing that it effectively decouples amplitude and phase variability. 
Aligned growth velocities are also shown in Figure [4] (right panel). 

As it can be appreciated in Table [4j the results of sparse and standard functional 2-means 
of misaligned growth velocities are exactly the same: the two partitions coincide, and both 
estimate the gender classification quite well. Moreover, the weighting function estimated by 
sparse functional 2-means (Figure |5j left panel) mainly points out the pubertal peak period. 


All these conclusions are reasonable, and quite coherent with previous findings (Sangalli 


et ah, 2010; Chen et ah. 2014). But can we gain some further insight when analysing the 


aligned growth velocities sample, where phase variability has been properly removed? 
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Weighting function Weighting function of aligned data 




age aligned age 

Figure 5: results of the analysis of the Berkeley Growth Study data. Estimated weighting 
functions obtained by applying sparse functional 2-means to misaligned growth velocities 
(left) and to aligned ones (right). 

Interestingly, when analysing the aligned growth velocities with standard and sparse 
functional 2-means, we obtain two quite different grouping structures, none of which reflects 
gender stratification. The gender and standard 2-means classifications seem not to point out 
relevant features characterizing the clusters (see Figure [bj center and bottom). Instead, the 
2 clusters obtained by sparse functional 2-means distinguish among children having their 
mid-spurt, from children not having it (see Figure [bj top). This fact is even more evident 
if we look at the weighting function estimated by sparse functional 2-means when applied 
to aligned growth velocities, shown in Figure [5] (right panel): the period of the mid-spurt is 
there indicated as the most relevant portion of the domain. 

In conclusion, the sparse functional A'-means clustering of the aligned growth veloci¬ 
ties shows novel insight on the Berkeley Growth Study data, finally pointing out a data 
stratification possibly related to the children’s development in the early stages of their lives. 
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Sparse K-means, Cluster 1 


Sparse K-means, Cluster 2 





aligned age 


aligned age 


Figure 6: in grey in all panels, aligned growth velocities restricred to the subset of the domain 
where the weighting function estimated by sparse functional 2-means has higher values. In 
the top panels, curves are colored according to the clustering given by sparse functional 2- 
means. In the center panels, they are colored according to the result of standard functional 
2-means, while in the bottom panels they are colored according to gender (blue for males 
and pink for females). 
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6 Discussion 


In this paper we have first proposed a framework for sparse clustering of multivariate data 
based on hard thresholding, and proved its accuracy on simulated scenarios. We have then 
started from this novel definition of sparse clustering for multivariate data to properly define 
a quite general framework for sparse clustering of functional data. We have proven the 
existence and uniqueness of the optimal solution, and proposed a quite general and flexible 
algorithm to estimate it. Finally, we have tested the method on both simulated and real 
data. 

Given the high level of novelty and generality of the paper, there are many further 
directions arising from the present proposal which are worth being explored. Firstly, properly 
treating the possible misalignment in the data is an issue: the problem of decoupling phase 
and amplitude variability is often encountered in functional data analysis (Ramsay and 


Silverman 2005), and we indeed face it in the applied case study of Section [5] It the context 


of A'-means clustering it has been considered in Sangalli et al. (2010), where the authors prove 
the optimality of performing clustering and alignment jointly. It would be then interesting 
and relevant to develop a joint sparse functional clustering and alignment method, possibly 
merging the two approaches. 

Another interesting open issue concerns the functional measure: the variational problem 
© could have a different solution if another functional measure different from L 2 were to 
be considered, but we may then also not be able to prove results similar to Theorem [2j An 
interesting choice would be, for instance, the measure of the density generating the data. 
We would then have the following formulation for problem (JTT| 


max [ w(x)b(x, f ijk ,C)dF(X k (uj,x),x), (19) 

w(x),(Ci,...,C K ) J d 

where P(Afc(a j,x),x) is a probability measure depending both on the functional variables 
generating the data and the domain, while C = (Ci,..., Ck) is the partition. A related 
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possibility is to cluster the data jointly with their successive derivatives, if they exist. Again, 
the problem is proving the existence/uniqueness of the solution in the proper functional 
space. 

Another issue related to the variational problem concerns the optimization strategy: 
it would be interesting to define an iterative procedure converging to a global maximum. 
Moreover, if not solved in the previous point, it would be interesting also to try other 
clustering strategies. Finally, the tuning parameter m is also worth some further thoughts: 
the GAP statistics-inspired criterion that we are currently using is not really generalized but 
just adapted to the functional case, and surely something better can be developed. 
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A Appendix: alternative proof of Theorem [2 


We here give a completely variational proof of Theorem [2] 
Theorem 3. The variational problem: 


max / w(x)b(x)dx , (20) 

w£L 2 (D) J d 

subject to: \\w(x)\\l2(d) < 1, w(x) > 0 p — a.e. and p({x G D : w(x) = 0}) > m, 
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with D C M compact, 0 < m < n(D) < oo, b(x) continuous and non-negative a.e. and 
l_i a regular, non-atomic and translation-invariant measure has a solution given by w(x) = 
— Ib(x), where B = {x G D : b(x) > k} and k G M + depend on m. Moreover, the 
solution is [p]-a.e. unique if the function <f>(t) := p,({x G D : b(x) < t}) is continuous. 

Proof. Let us fix a k G M + , thanks to the continuity of b, such that g(B) = p(D) — m. It 


is clear that the function w(x) as defined above, satisfies the constraints in (20). We claim 


that w(x) is also the solution of (20) 


We will proceed in the proof deriving absurd statements if we assume that the maximizing 
function differs somehow from the proposed solution w(x). So, suppose w(x) is not the 


maximum for the functional in (20). Then, there exists another function g(x) that satisfies 
the constraints and 

/ g{x)b{x)dyL > / w(x)b(x)d/j,. (21) 

J D J D 

Define the set G := {x D \ g{x) > 0}. We will proceed firstly by showing that, if we 
suppose that G differs from B for a set of positive measure, we will always be able to find 
another function, more similar to the defined w(x) and satisfying the constraints, that gives 


us a higher value in the functional in (20). Then, given that G = B \fj] —a.e., if we suppose 


that (21) holds and g(x) ^ w(x) on a set of positive measure, again we obtain contradictory 
statements. 

So, let us observe the following considerations. 

B C G, with fJ>(B) < g,(G) cannot happen, because g{x) would not satisfy the measure 
constraint. 

It cannot be that G C B and g{G) < fi(B), because, for any g satisfying the constraints, 
we would have: 


g{x)b{x)dp, = / g(x)b(x)dpL 

' Jg 

< HsWIli^GjIlKUIk^o) 


( 22 ) 








< \\K X )\\ L2(G) 

< \\Kx)\\lHb) 

= / w(x)b(x)d/jL = / w(x)b(x)dfi. 


IB 


'D 


Now, suppose that the sets G and B are different, with a non-empty intersection with 
each other. Specifically, let us suppose that 


1. Gn5^0; 

2. Gn(d\5)^0; 

3. n(G fl B) > 0 and fi{G D (D \ B)) > 0; 

4. fi(G n B) + fi(G n (D\ B )) > 


This is the most complicated case, therefore we initially assume B \ (B fl G) contains an 
interval J. We want to show that, given those hypotheses on the sets B and G, no matter 
how g is, we can build a function g for which we reach a greater value in the functional 


in (20). For the sake of notation, let us set F G D (D \ B). T C D \ B, which is 
compact, because B is open in D. So, there exists an open finite subcover with intervals, 
(J ” =1 Ki, of D\ B. There exists at least one of the Kd s which has non empty intersection 
with T and that intersection is of positive measure, otherwise T itself is of measure zero, 
which would contradict the hypothesis. So, V C (J ig2 - K i: with Ic{l,...,n} and take a Jlj 
(or a suitable subinterval of it, which we still call it K t , with abuse of notation) such that 
0 < afi(Ki) < /i(r ft Ki) < n(J), with a suitable chosen 0 < a < 1. Choose a r such that 
the set Ki + r is contained in J. The set T := T fl Ki C K t is then translated in J and 
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fi(T + r) = fi(T) > 0. Setting (f>: T —> {T + r}, x H> x + t, we define a new function g 


9{x) = 


0 


if x G r 


( g(4> 1 (x))/f((/) x (x)) if x e f + r 
g(x) otherwise. 


( 23 ) 


It is easily seen that the constraints of the problem are satisfied and f D g(x)b(x)d/i < 
f D g(x)b(x)d/i, because Va; G B and Vy G D\ B, g(y)b(y ) < g(y)b(x). 

Now, let us suppose that the set B \ (B fl G) does not contain intervals. If there is a 
subset F in B \ (B fl G) and an interval E, such that y(EAF) = 0, we can adopt the same 
construction of the case above, with J replaced by E. 

Finally, we discuss the last case. B is open in D, therefore there exists a countable family 
of open intervals {O n } ne N such that B C an< ^ we f ur th er suppose that, for any n, 

the intersection B \ (B fl G) fl O n has never full measure. We can choose a suitable K t (as 
defined above), a O n (or suitable subintervals of those sets) and 0 < au, «2 < 1 , for which: 


0 < ai n(Ki) < /x(f) < n(Ki), 


(24) 


0 < a 2 fi(O n ) < n(B \(BHG)n O n ) < v(O n ), (25) 

and with 

n{Ki) < //(On). (26) 

Now, let us take a translation r mapping K t one-to-one and into O n and such that 
{f + r} fl (B \ (B fl G)) fl O n 7 ^ 0. If we can show that the set {f + r} fl (B \ (B fl G)) fl O n 
has positive measure, then we can re-define the map 0 : r fl {(B \ (B fl G) fl O n ) — r} —> 
{r + r} fl (B \ (B fl G)) n On and use the same construction of g to arrive to the same 
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conclusion. From (25) and the properties of measure, we have: 


M(r + r} n (5 \ (5 n G)) n o n ) > a 2/ r({r + r} n o n ) 


( 27 ) 


of 2 /x(r n {o n - r}) 


> a 2 n{r n Ki) 


> a 2 ain(Ki) > 0 


and so the claim follows. 

The previous discussion allows us to consider the case with G and B having empty 
intersection. We can use the same procedure used in the case above, with B \ (B D G) 
containing an interval. 

Therefore, we can assume that, necessarily, we must have /i(GAB) = 0. Now, given that 
assumption, we show that if g(x) ^ w(x) on a set of positive measure, g(x) cannot be the 
maximum for the functional in (20). 

Suppose that g>(GAB) = 0 and 3 U C B, such that g,(U) > 0 with 

{ w(x) if x G D \ U 

(28) 

w(x) + h(x) if x G U, 

with h(x) G L°°(U ) and h(x) > 0 a.e. Then, we would obtain: 



The case with 


9(x) 


w(x) 


if x G D \ U 


(29) 


w(x) — h(x) if x G U, 
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0 < h(x) < w(x)Iu(x), is clearly rejected since such a g{x) would not be the solution for 
Suppose, now, that n(GAB) = 0 and 3 U, V C B, such that n(U) > 0, n(V) > 0 with 


w(x) if x e D \ (U U V) 


9{x) 


w(x) + f{x) if x G U 


w{x) — h(x ) if x G V 


(30) 


with f{x), hix) G L°°iU) and fix), /i(x) > 0 a.e. The two functions / and h have to satisfy 
some constraints in order for g to be admissible. So, let us compute the L 2 norm of the 
defined g: 


Mx )\\ 2 


L 2 ( G) = / wix) 2 d/j,+ / iwix) +fix)) dfi 

Jd\{uuv ) Ju 

+ / iwfx) — hix)) 2 dfi. 

Jv 


Rearranging the integrals and substituting the definition of w, we obtain 


\\ 9 (x)\\l HG) 


= i + \\f(x)\\h(m + \\h{x)\\hon 

+ WK X )W L\B) (X - Jy h ( X ) h ( X ) d d^ ■ 


For g to be admissible, then, we must have: 


(31) 


\\m\\h ( u) + \\H^\\h ( v)+ ( 32 ) 

Fwkh ( L f{x)b{xW ' 1 = °- 

which leads to: 

(^ j f ix)bix)dfj, — J hix)bix)dfj^J = (33) 
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(ll/MIlW IIM*)lll> ( v,) 


\\b{x)\\ L 2 {B) 

2 


Note that the second member is negative, 
such a g, we have 


Now, if we evaluate the functional in (20) with 


g(x)b(x)d/j, = / w(x)b(x)dp + / f(x)b(x)dg — / h(x)b{x)dg ; 


id 


' B 


'U 


'V 


= \\b(x)\\ L 2 (B) - 


l|6WlliI<B) 'll/WllV) + l|fcWIII W 


< WH x )Wl\b) = / w(x)b(x)dn- 

J D 


(34) 


Thus g cannot be the maximum. We therefore conclude that w(x), as we defined it, has to 
be the solution to the problem. □ 
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